Skip to content

[BugFix] Seperate prometheus multiproc dir for single-server multi-dp services#8059

Open
liyonghua0910 wants to merge 4 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260616_fix_dp_metrics
Open

[BugFix] Seperate prometheus multiproc dir for single-server multi-dp services#8059
liyonghua0910 wants to merge 4 commits into
PaddlePaddle:developfrom
liyonghua0910:develop+20260616_fix_dp_metrics

Conversation

@liyonghua0910

@liyonghua0910 liyonghua0910 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Fix metric interference when multiple data-parallel services run on one server by isolating Prometheus multiprocess files per DP rank.

Modifications

  • Track the original PROMETHEUS_MULTIPROC_DIR set during metrics initialization.
  • Add setup_dp_prometheus_dir() to create per-DP dp{i} subdirectories and switch the target environment.
  • Apply DP-specific Prometheus dirs when launching internal-adapter DP services from LLMEngine / EngineService and when multi_api_server starts per-DP API server processes.
  • Update unit tests for multi API server Prometheus dirs and Prometheus setup behavior.

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@e58f31c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/engine.py 16.66% 5 Missing ⚠️
fastdeploy/engine/common_engine.py 33.33% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8059   +/-   ##
==========================================
  Coverage           ?   67.49%           
==========================================
  Files              ?      475           
  Lines              ?    66955           
  Branches           ?    10332           
==========================================
  Hits               ?    45193           
  Misses             ?    18888           
  Partials           ?     2874           
Flag Coverage Δ
GPU 77.52% <75.67%> (?)
XPU 6.95% <35.13%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-07-02 17:01:41 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 06130f1 | Merge base: cbb0811 (branch: develop)


1 Required任务 : 9/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
42(0) 42 37 4 0 0 0
任务 错误类型 置信度 日志
Approval 需要 Approval Job

2 失败详情

🔴 Approval — 需要 Approval(置信度: 高)

该 Job 需要人工 Approval,完成审批后 CI 才会继续执行。

修复建议:请通过人工审批。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-07-03 19:33:30

📋 Review 摘要

PR 概述:为 Prometheus multiprocess 目录增加按 DP rank 隔离逻辑,覆盖 engine/internal-adapter 和 multi_api_server 场景
变更范围fastdeploy/metrics/fastdeploy/engine/fastdeploy/entrypoints/openai/ 及相关单测
影响面 Tag[Engine] [APIServer]

问题

级别 文件 概述
🔴 Bug fastdeploy/entrypoints/openai/multi_api_server.py:112 multi_api_server 复用会迁移当前进程 .db 的 helper,可能把 supervisor 指标混入 DP0

📝 PR 规范检查

符合规范。

总体评价

按 DP 拆目录方向正确,但 multi_api_server 是 subprocess 启动模型,与 engine 的 fork/当前进程切换语义不同。需要避免在只准备子进程 env 时搬迁父进程 Prometheus 文件,否则 DP0 metrics 仍会被污染。

os.makedirs(prom_dir_i, exist_ok=True)
env["PROMETHEUS_MULTIPROC_DIR"] = prom_dir_i
logger.info(f"Set PROMETHEUS_MULTIPROC_DIR for DP {i}: {prom_dir_i}")
setup_dp_prometheus_dir(i, env["PROMETHEUS_MULTIPROC_DIR"], env)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 这里是在构造传给 subprocess.Popen 的子进程 env,但 setup_dp_prometheus_dir()dp_id == 0 会把当前进程 base_dir 下已有 .db 文件移动到 dp0/。当前进程是 multi-api supervisor,不是 DP0 服务;这些文件可能来自父进程在导入 FastDeploy/初始化 metrics 时创建,移动后 DP0 子进程的 /metrics 会把 supervisor 的 .db 一起 collect,继续造成 DP0 指标污染。

建议修复方式:把“迁移已有 .db”限定在真正切换当前进程环境的场景,例如 env_dict is None 时才执行 DP0 move;或者给 helper 增加 move_existing 参数,并在 multi_api_server 调用时传 Falsemulti_api_server 只需要创建 base_dir/dp{i} 并写入子进程 env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants